{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dGATMXZE4oAD"
   },
   "source": [
    "# Ungraded Lab: Training a binary classifier with the IMDB Reviews Dataset\n",
    "\n",
    "In this lab, you will be building a sentiment classification model to distinguish between positive and negative movie reviews. You will train it on the [IMDB Reviews](http://ai.stanford.edu/~amaas/data/sentiment/) dataset and visualize the word embeddings generated after training.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "U5IU8moFl-2k"
   },
   "source": [
    "## Imports\n",
    "\n",
    "As usual, you will start by importing the necessary packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "h5TyFzPKmBSg"
   },
   "outputs": [],
   "source": [
    "import tensorflow_datasets as tfds\n",
    "import tensorflow as tf\n",
    "import io"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sWt8QlOGpy1j"
   },
   "source": [
    "## Download the Dataset\n",
    "\n",
    "You will load the dataset via [Tensorflow Datasets](https://www.tensorflow.org/datasets), a collection of prepared datasets for machine learning. If you're running this notebook on your local machine, make sure to have the [`tensorflow-datasets`](https://pypi.org/project/tensorflow-datasets/) package installed before importing it. You can install it via `pip` as shown in the commented cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vqB3GzBorwBh"
   },
   "outputs": [],
   "source": [
    "# Install this package if running on your local machine\n",
    "# !pip install -q tensorflow-datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SpLsMxO2wDrn"
   },
   "source": [
    "The [`tfds.load()`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) method downloads the dataset into your working directory. You can set the `with_info` parameter to `True` if you want to see the description of the dataset. The `as_supervised` parameter, on the other hand, is set to load the data as `(input, label)` pairs.\n",
    "\n",
    "To ensure smooth operation, the data was pre-downloaded and saved in the data folder. When you have the data already downloaded, you can read it by passing two additional arguments. With ` data_dir=\"./data/\"`, you specify the folder where the data is located (if different than default) and by setting `download=False` you explicitly tell the method to read the data from the folder, rather than downloading it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_IoM4VFxWpMR"
   },
   "outputs": [],
   "source": [
    "# Load the IMDB Reviews dataset\n",
    "imdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True, data_dir=\"./data/\", download=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "J3PEarpKw9_j"
   },
   "outputs": [],
   "source": [
    "# Print information about the dataset\n",
    "print(info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kLRAoHil5poj"
   },
   "source": [
    "As you can see in the output above, there is a total of 100,000 examples in the dataset and it is split into `train`, `test` and `unsupervised` sets. For this lab, you will only use `train` and `test` sets because you will need labeled examples to train your model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5EzNDkdkpvrv"
   },
   "source": [
    "## Split the dataset\n",
    "\n",
    "The `imdb` dataset that you downloaded earlier contains a dictionary pointing to [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tA5397cs-EwN"
   },
   "outputs": [],
   "source": [
    "# Print the contents of the dataset\n",
    "print(imdb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "L4oiQ0waBduJ"
   },
   "source": [
    "You can preview the raw format of a few examples by using the [`take()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) method and iterating over it as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2NgUwTDu7Q1O"
   },
   "outputs": [],
   "source": [
    "# View 4 training examples\n",
    "for example in imdb['train'].take(4):\n",
    "  print(example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hOtXX2gxB8pe"
   },
   "source": [
    "You can see that each example is a 2-element tuple of tensors containing the text first, then the label (shown in the `numpy()` property). The next cell below will take all the `train` and `test` sentences and labels into separate lists so you can preprocess the text and feed it to the model later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wHQ2Ko0zl7M4"
   },
   "outputs": [],
   "source": [
    "# Get the train and test sets\n",
    "train_dataset, test_dataset = imdb['train'], imdb['test']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ePTIgXj3q8Sg"
   },
   "source": [
    "## Generate Padded Sequences\n",
    "\n",
    "Now you can do the text preprocessing steps you've learned last week. You will convert the strings to integer sequences, then pad them to a uniform length. The parameters are separated into its own code cell below so it will be easy for you to tweak it later if you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lggoZqYUGYgX"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "\n",
    "VOCAB_SIZE = 10000\n",
    "MAX_LENGTH = 120\n",
    "EMBEDDING_DIM = 16\n",
    "PADDING_TYPE = 'pre'\n",
    "TRUNC_TYPE = 'post'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VTqlsJkMgIQU"
   },
   "source": [
    "An important thing to note here is you should generate the vocabulary based only on the training set. You should not include the test set because that is meant to represent data that the model hasn't seen before. With that, you can expect more unknown tokens (i.e. the value `1`) in the integer sequences of the test data. Also for clarity in demonstrating the transformations, you will first separate the reviews and labels. You will see other ways to implement the data pipeline in the next labs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YC5d_o8fctHP"
   },
   "outputs": [],
   "source": [
    "# Instantiate the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(\n",
    "    max_tokens=VOCAB_SIZE,\n",
    ")\n",
    "\n",
    "# Get the string inputs and integer outputs of the training set\n",
    "train_reviews = train_dataset.map(lambda review, label: review)\n",
    "train_labels = train_dataset.map(lambda review, label: label)\n",
    "\n",
    "# Get the string inputs and integer outputs of the test set\n",
    "test_reviews = test_dataset.map(lambda review, label: review)\n",
    "test_labels = test_dataset.map(lambda review, label: label)\n",
    "\n",
    "# Generate the vocabulary based only on the training set\n",
    "vectorize_layer.adapt(train_reviews)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3ilS5b1Z_38S"
   },
   "source": [
    "You will define a padding function to generate the padded sequences. Note that the `pad_sequences()` function expects an iterable (e.g. list) while the input to this function is a `tf.data.Dataset`. Here's one way to do the conversion:\n",
    "* Put all the elements in a single [ragged batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#ragged_batch) (i.e. batch with elements that have different lengths).\n",
    "    * You will need to specify the batch size and it has to match the number of all elements in the dataset. From the output of the dataset info earlier, you know that this should be 25000.\n",
    "    * Instead of specifying a specific number, you can also use the [cardinality()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cardinality) method. This computes the number of elements in a `tf.data.Dataset`.\n",
    "* Use the [get_single_element()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#get_single_element) method on the single batch to output a Tensor.\n",
    "* Convert back to a `tf.data.Dataset`. You'll see why this is needed in the next cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "IJ7vaovZlRKW"
   },
   "outputs": [],
   "source": [
    "def padding_func(sequences):\n",
    "  '''Generates padded sequences from a tf.data.Dataset'''\n",
    "\n",
    "  # Put all elements in a single ragged batch\n",
    "  sequences = sequences.ragged_batch(batch_size=sequences.cardinality())\n",
    "\n",
    "  # Output a tensor from the single batch\n",
    "  sequences = sequences.get_single_element()\n",
    "\n",
    "  # Pad the sequences\n",
    "  padded_sequences = tf.keras.utils.pad_sequences(sequences.numpy(), \n",
    "                                                  maxlen=MAX_LENGTH, \n",
    "                                                  truncating=TRUNC_TYPE, \n",
    "                                                  padding=PADDING_TYPE\n",
    "                                                 )\n",
    "\n",
    "  # Convert back to a tf.data.Dataset\n",
    "  padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)\n",
    "\n",
    "  return padded_sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FseKjCoytD31"
   },
   "source": [
    "This is the pipeline to convert the raw string inputs to padded integer sequences:\n",
    "* Use the [map()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) method to pass each string to the `TextVectorization` layer defined earlier.\n",
    "* Use the [apply()](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#apply) method to use the padding function on the entire dataset.\n",
    "\n",
    "The difference between `map()` and `apply()` is the mapping function in `map()` expects its input to be single elements (i.e. element-wise transformations), while the transformation function in `apply()` expects its input to be the entire dataset in the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gzPruuz7zgdw"
   },
   "outputs": [],
   "source": [
    "# Apply the layer to the train and test data\n",
    "train_sequences = train_reviews.map(lambda text: vectorize_layer(text)).apply(padding_func)\n",
    "test_sequences = test_reviews.map(lambda text: vectorize_layer(text)).apply(padding_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I5rsqxkie7Ht"
   },
   "source": [
    "You can take a few examples from the results and observe that the raw strings are now converted to padded integer sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Uk3h4RoLDexY"
   },
   "outputs": [],
   "source": [
    "# View 2 training sequences\n",
    "for example in train_sequences.take(2):\n",
    "  print(example)\n",
    "  print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u5uFax8Fwh2x"
   },
   "source": [
    "You will now re-combine the sequences with the labels to prepare it for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZCDPu0YiJAWv"
   },
   "outputs": [],
   "source": [
    "train_dataset_vectorized = tf.data.Dataset.zip(train_sequences,train_labels)\n",
    "test_dataset_vectorized = tf.data.Dataset.zip(test_sequences,test_labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "e-Asn__NJlp5"
   },
   "outputs": [],
   "source": [
    "# View 2 training sequences and its labels\n",
    "for example in train_dataset_vectorized.take(2):\n",
    "  print(example)\n",
    "  print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_aUWaTqIiGPZ"
   },
   "source": [
    "Lastly, you will optimize and batch the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sALj9IHCiwdp"
   },
   "outputs": [],
   "source": [
    "SHUFFLE_BUFFER_SIZE = 1000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Optimize the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .cache()\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N2rCmp7ArGL_"
   },
   "source": [
    "## Build and Compile the Model\n",
    "\n",
    "With the data already preprocessed, you can proceed to building your sentiment classification model. The input will be an [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer. The main idea here is to represent each word in your vocabulary with vectors. These vectors have trainable weights so as your neural network learns, words that are most likely to appear in a positive tweet will converge towards similar weights. Similarly, words in negative tweets will be clustered more closely together. You can read more about word embeddings [here](https://www.tensorflow.org/text/guide/word_embeddings).\n",
    "\n",
    "After the `Embedding` layer, you will flatten its output and feed it into a `Dense` layer. You will explore other architectures for these hidden layers in the next labs.\n",
    "\n",
    "The output layer would be a single neuron with a sigmoid activation to distinguish between the 2 classes. As is typical with binary classifiers, you will use the `binary_crossentropy` as your loss function while training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5NEpdhb8AxID"
   },
   "outputs": [],
   "source": [
    "# Build the model\n",
    "model = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM),\n",
    "    tf.keras.layers.Flatten(),\n",
    "    tf.keras.layers.Dense(6, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Setup the training parameters\n",
    "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Print the model summary\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "e8gbnoRdqp8O"
   },
   "source": [
    "## Train the Model\n",
    "\n",
    "Next, of course, is to train your model. With the current settings, you will get near perfect training accuracy after just 5 epochs but the validation accuracy will only be at around 80%. See if you can still improve this by adjusting some of the parameters earlier (e.g. the `VOCAB_SIZE`, number of `Dense` neurons, number of epochs, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "V5LLrXC-uNX6"
   },
   "outputs": [],
   "source": [
    "NUM_EPOCHS = 5\n",
    "\n",
    "# Train the model\n",
    "model.fit(train_dataset_final, epochs=NUM_EPOCHS, validation_data=test_dataset_final)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mroDvjEJqwm4"
   },
   "source": [
    "## Visualize Word Embeddings\n",
    "\n",
    "After training, you can visualize the trained weights in the `Embedding` layer to see words that are clustered together. The [Tensorflow Embedding Projector](https://projector.tensorflow.org/) is able to reduce the 16-dimension vectors you defined earlier into fewer components so it can be plotted in the projector. First, you will need to get these weights and you can do that with the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yAmjJqEyCOF_"
   },
   "outputs": [],
   "source": [
    "# Get the embedding layer from the model (i.e. first layer)\n",
    "embedding_layer = model.layers[0]\n",
    "\n",
    "# Get the weights of the embedding layer\n",
    "embedding_weights = embedding_layer.get_weights()[0]\n",
    "\n",
    "# Print the shape. Expected is (vocab_size, embedding_dim)\n",
    "print(embedding_weights.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DEuG9AqIuF6i"
   },
   "source": [
    "You will need to generate two files:\n",
    "\n",
    "* `vecs.tsv` - contains the vector weights of each word in the vocabulary\n",
    "* `meta.tsv` - contains the words in the vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ykM0Q9ThvszB"
   },
   "source": [
    "You will get the word list from the `TextVectorization` layer you adapted earler, then start the loop to generate the files. You will loop `vocab_size-1` times, skipping the `0` key because it is just for the padding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jmB0Uxk0ycP6"
   },
   "outputs": [],
   "source": [
    "# Open writeable files\n",
    "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
    "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
    "\n",
    "# Get the word list\n",
    "vocabulary = vectorize_layer.get_vocabulary()\n",
    "\n",
    "# Initialize the loop. Start counting at `1` because `0` is just for the padding\n",
    "for word_num in range(1, len(vocabulary)):\n",
    "\n",
    "  # Get the word associated with the current index\n",
    "  word_name = vocabulary[word_num]\n",
    "\n",
    "  # Get the embedding weights associated with the current index\n",
    "  word_embedding = embedding_weights[word_num]\n",
    "\n",
    "  # Write the word name\n",
    "  out_m.write(word_name + \"\\n\")\n",
    "\n",
    "  # Write the word embedding\n",
    "  out_v.write('\\t'.join([str(x) for x in word_embedding]) + \"\\n\")\n",
    "\n",
    "# Close the files\n",
    "out_v.close()\n",
    "out_m.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3t92Osu3u8Qh"
   },
   "source": [
    "You can find the files in your current working directory and download them. Now you can go to the [Tensorflow Embedding Projector](https://projector.tensorflow.org/) and load the two files you downloaded to see the visualization. You can search for words like `worst` and `fantastic` and see the other words closely located to these."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4GOiu0WHzMzk"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "In this lab, you were able build a simple sentiment classification model and train it on preprocessed text data. In the next lessons, you will revisit the Sarcasm Dataset you used in Week 1 and build a model to train on it."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
